The Fourier Series is given by
$$ s_N(t) = \frac{1}{2}a_0 + \sum_{n=1}^\infty a_n cos(2 \pi f nt) + b_n sin(2 \pi f nt) $$where $t$ is the time, $f$ is the frequency, and $a_i$ and $b_i$ are amplitudes.
The power of Fourier Series lies in the ability to approximate many functions by selecting appropriate coefficents $a_i$ and $b_i$. In doing so, we get a frequency decomposition of the original signal. The process of determining the coefficients is called a Fourier Transform (FT) and is used extensively in digital signal processing. Common uses of FTs include identifying the frequencies present in signals and filtering or selecting specific frequency ranges.
Here, I'm going to give a practical tutorial about using the fast Fourier Transforms (FFTs) to analyze signals. Conceptually, FFTs are not hard to use -- the difficulties lie in knowing how to use the results and properly convert the units.
Let's start with a simple example where we create a signal containing specific frequencies:
$$ x(t) = A \sin(2 \pi f t + \phi) $$where $A$ is the amplitude, $f$ is the frequency (given in cycles per unit time), $t$ is the time given in seconds, and $\phi$ is the phase given in seconds.
We'll create 3 signals with frequencies of 1, 2, and 4 Hz, all with amplitudes of 1.
In [10]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
n_samples = 128
# time, in seconds
ts = np.linspace(0., 1., num=n_samples)
x_1 = np.sin(2. * np.pi * 1. * ts)
x_2 = np.sin(2. * np.pi * 2. * ts)
x_4 = np.sin(2. * np.pi * 4. * ts)
plt.plot(ts, x_1, label="1 Hz")
plt.plot(ts, x_2, label="2 Hz")
plt.plot(ts, x_4, label="4 Hz")
plt.plot(ts, x_1 + x_2 + x_4, label="Combined")
plt.legend()
plt.xlim([0., 1.25])
plt.ylim([-3., 3.])
plt.xlabel("Time (seconds)")
plt.ylabel("Amplitude")
Out[10]:
In the above example, the blue line is the 1 Hz signal -- it goes through 1 period per second. The green line is the 2 Hz signal -- it goes through 2 periods per second. The red line is the 4 Hz signal -- it goes through 4 periods per second. The cyan line is the combined signal.
We took 128 samples over 1 second, or 128 Hz. According to the Nyquist theorem, to sample a signal of frequency $f$, we must take samples at a frequency of $2f$. As a result, the fastest signal we can determine from our samples is 64 Hz.
The slowest signal we can determine is determined by the elapsed time covered in our samples. In our case, our sample covered 1 s, so our slowest recoverable signal is 1 Hz.
Knowing the time period (1 s) covered by the samples, the number of samples (128), and the resulting sampling frequency will be important for converting the results of the FFT to the correct units.
In [11]:
elapsed_time = 1.
freq = np.fft.fftfreq(n_samples, d=elapsed_time / n_samples)
print freq
coeff = np.fft.fft(x_1 + x_2 + x_4) / n_samples / 2.
print coeff
amplitudes = np.abs(coeff)
plt.plot(freq[0:n_samples/2], amplitudes[0:n_samples/2])
plt.xlabel("Frequency (Hz)", fontsize=16)
plt.ylabel("Amplitude", fontsize=16)
plt.xlim([0, 20])
Out[11]:
The FFT results in a complex vector of 128 entries which gives the $a_i$ and $b_i$ coefficients. The vector needs to be divided by half of the number of samples to get the proper amplitudes.
The first entry is the zero-frequency term, corresponding to the mean amplitude of the signal. Entries 1 through 64 are the coefficients for the positive frequencies, while the remaining 64 entries are the coefficients for the negative frequencies. Since our input signal was real, the coefficients of the corresponding positive and negative frequencies are complex conjugates, meaning that we can ignore the coefficients for the negative frequencies.
To compute the amplitudes, we need to take the magnitudes of the complex numbers.
To demonstrate the usage of the FFT in applied situation, we'll analyze CPU usage for a single machine over four days. The CPU usage was sampled every 10 min. The collected data was broken into four groups, one per day. We'll use the FFT to look for common frequencies.
In [12]:
def read_cpu_usage(flname):
fl = open(flname)
cpu_usage = []
for ln in fl:
time_s, usage_s = ln.split()
cpu_usage.append((time_s, float(usage_s)))
fl.close()
return zip(*cpu_usage)
sample_period = 10.
machine1_01 = np.array(read_cpu_usage("fft_usage_data/machine1_01_cpu.txt")[1])
machine1_02 = np.array(read_cpu_usage("fft_usage_data/machine1_02_cpu.txt")[1])
machine1_03 = np.array(read_cpu_usage("fft_usage_data/machine1_03_cpu.txt")[1])
machine1_25 = np.array(read_cpu_usage("fft_usage_data/machine1_25_cpu.txt")[1])
elapsed_time_min = 24. * 60.
sample_times = np.arange(sample_period, elapsed_time_min, sample_period)
plt.plot(sample_times, machine1_01)
plt.plot(sample_times, machine1_02)
plt.plot(sample_times, machine1_03)
plt.plot(sample_times, machine1_25)
plt.xlabel("Time (minutes)", fontsize=16)
plt.ylabel("CPU Usage(%)", fontsize=16)
plt.ylim([0.0, 20.0])
Out[12]:
I plotted the CPU usage for the machine on the four days. Note that the four signals overlap nearly identifically for a subset of higher frequency perturbations. The low frequency low perturbations are quite a bit different. Using a FT, we can extract the frequencies to identify which signals overlap.
By the Nyquist theorem, our fastest recoverable signal will have a period of 20 min. Our slowest signal will have a period of 24 hours.
In [13]:
n_samples = 143
freq = np.fft.fftfreq(n_samples, d=sample_period)
print freq
coeff1_01 = np.fft.fft(machine1_01) / (n_samples / 2.0)
amplitudes1_01 = np.abs(coeff1_01)
coeff1_02 = np.fft.fft(machine1_02) / (n_samples / 2.0)
amplitudes1_02 = np.abs(coeff1_02)
coeff1_03 = np.fft.fft(machine1_03) / (n_samples / 2.0)
amplitudes1_03 = np.abs(coeff1_03)
coeff1_25 = np.fft.fft(machine1_25) / (n_samples / 2.0)
amplitudes1_25 = np.abs(coeff1_25)
plt.plot(freq[:(n_samples+1)/2], amplitudes1_01[:(n_samples+1)/2])
plt.plot(freq[:(n_samples+1)/2], amplitudes1_02[:(n_samples+1)/2])
plt.plot(freq[:(n_samples+1)/2], amplitudes1_03[:(n_samples+1)/2])
plt.plot(freq[:(n_samples+1)/2], amplitudes1_25[:(n_samples+1)/2])
plt.plot([0.015, 0.015], [0, 3.5], "k-")
plt.plot([0.019, 0.019], [0, 3.5], "k-")
plt.plot([0.03, 0.03], [0, 3.5], "k-")
plt.plot([0.0375, 0.0375], [0, 3.5], "k-")
plt.xlabel("Frequency (cycles/min)", fontsize=16)
plt.ylabel("Amplitude", fontsize=16)
plt.xlim([0, max(freq)])
Out[13]:
The frequency analysis show the highest-levels of similarity in the ranges of 0.015 - 0.019 cycles/min and 0.03 - 0.0375 cycles/min, demarcated by the black horizontal lines. These frequencies correspond to periods of approximately 53 - 67 min and 2-5 - 3.5 min, respectively. Overall, the signals show qualitatively similarly with frequencies above 0.015 cycles/min or periods below 67 min.
Let's look at a second machine. In this case, the data was sampled every 5 min over three days.
In [14]:
sample_period = 5.
machine2_01 = np.array(read_cpu_usage("fft_usage_data/machine2_01_cpu.txt")[1])
machine2_02 = np.array(read_cpu_usage("fft_usage_data/machine2_02_cpu.txt")[1])
machine2_03 = np.array(read_cpu_usage("fft_usage_data/machine2_03_cpu.txt")[1])
elapsed_time_min = 24. * 60.
sample_times = np.arange(sample_period, elapsed_time_min - sample_period, sample_period)
plt.plot(sample_times, machine2_01)
plt.plot(sample_times, machine2_02)
plt.plot(sample_times, machine2_03)
plt.xlabel("Time (minutes)", fontsize=16)
plt.ylabel("CPU Usage(%)", fontsize=16)
plt.ylim([0.0, 100.0])
Out[14]:
I plotted the CPU usage for the machine on the three days. Note that the three signals overlap nearly identifically -- they are more similary than those of the first machine. Again, we'll use a FT to extract the frequencies to identify which signals overlap.
By the Nyquist theorem, our fastest recoverable signal will have a period of 10 min. Our slowest signal will have a period of 24 hours.
In [15]:
sample_period = 5.
n_samples = 286
freq = np.fft.fftfreq(n_samples, d=sample_period)
print freq
coeff2_01 = np.fft.fft(machine2_01) / (n_samples / 2.0)
amplitudes2_01 = np.abs(coeff2_01)
coeff2_02 = np.fft.fft(machine2_02) / (n_samples / 2.0)
amplitudes2_02 = np.abs(coeff2_02)
coeff2_03 = np.fft.fft(machine2_03) / (n_samples / 2.0)
amplitudes2_03 = np.abs(coeff2_03)
plt.plot(freq[:(n_samples+1)/2], amplitudes2_01[:(n_samples+1)/2])
plt.plot(freq[:(n_samples+1)/2], amplitudes2_02[:(n_samples+1)/2])
plt.plot(freq[:(n_samples+1)/2], amplitudes2_03[:(n_samples+1)/2])
plt.xlabel("Frequency (cycles/min)", fontsize=16)
plt.ylabel("Amplitude", fontsize=16)
plt.xlim([0, max(freq)])
Out[15]:
As with the CPU usage plots, the frequency plots for the second machine are nearly identical at all frequencies.
What if we compare the two machines?
In [16]:
elapsed_time_min = 24. * 60.
plt.plot(np.arange(10., elapsed_time_min, 10.), machine1_01, label="M1")
plt.plot(np.arange(5., elapsed_time_min - 5., 5.), machine2_01, label="M2")
plt.xlabel("Time (minutes)", fontsize=16)
plt.ylabel("CPU Usage(%)", fontsize=16)
plt.ylim([0.0, 100.0])
plt.legend()
Out[16]:
In [21]:
freq1 = np.fft.fftfreq(143, d=10.)
freq2 = np.fft.fftfreq(286, d=5.)
coeff1_01 = np.fft.fft(machine1_01) / (143. / 2.0)
amplitudes1_01 = np.abs(coeff1_01)
coeff2_01 = np.fft.fft(machine2_01) / (286. / 2.0)
amplitudes2_01 = np.abs(coeff2_01)
# since machine2 has more samples, the FFT can resolve twice as many
# high frequencies vs machine1. we have to use remove half the spectrum
# to make them comparable
plt.plot(freq1[:143/2], amplitudes1_01[:143/2], label="M1")
plt.plot(freq2[:143/2], amplitudes2_01[:143/2], label="M2")
plt.xlabel("Frequency (cycles/min)", fontsize=16)
plt.ylabel("Amplitude", fontsize=16)
plt.xlim([0, max(freq1)])
plt.legend()
Out[21]:
The FFTs of the usage from the two machines varies quite a bit qualitatively. First off, the amplitudes are very different due to the heavy cpu usage of machine 2. Machine 2 also had signals present across a wider range of frequencies.
In [17]: